Summary

This assesses the extent to which coding sequences are depleted in ATGs. It extends to a more general analysis of codon frequencies dependent on position and frame, metagene-style. This uses the best-transcript annotation and corresponding start codon position and sequence map made by Corinne Maufrais in June 2018.

Load Packages

Load CDS

ATG Codons in-frame are depleted near start, but others are more depleted.

This is smoothed to make it readable.

Other in-frame codons have very different depletion/enrichment

What are the non-ATG codons depleted near start?

GAA, GAT, GAG. Why?

GAA, GAT, GAG, TGT, are more depleted near-start than ATG, and CTC is strange

But not GAC. Why both E codons only one D codon?

TCN, GCC, CCC, are enriched near the start

Mostly serine. Is that the N-end rule for degradation at work or a translation initiation phenomenon?

Does depletion differ depending on the frame?

## # A tibble: 421,302 x 5
##      Pos Codon     n    Freq Frame
##    <int> <fct> <int>   <dbl> <fct>
##  1     1 ATG    6634 1       0    
##  2     2 AAA      49 0.00739 0    
##  3     2 AAC      86 0.0130  0    
##  4     2 AAG      87 0.0131  0    
##  5     2 AAT      28 0.00422 0    
##  6     2 ACA     130 0.0196  0    
##  7     2 ACC     146 0.0220  0    
##  8     2 ACG      51 0.00769 0    
##  9     2 ACT     127 0.0191  0    
## 10     2 AGA      51 0.00769 0    
## # ... with 421,292 more rows

Depletion of ATG codons is wildly different depending on frame

Not smoothed

Smoothed

Depletion/enrichment of many codons is wildly different depending on frame

Also not smoothed.

ATG and indeed NTG codons are strongly depleted in frame 2, consistent with TGN codons depleted in frame 0 due to avoiding premature termination and the rare amino acids Cysteine and Tryptophan.

ATG and some other ATN codons are depleted in frame 1. Most codons NNG are depleted near the start in frame 1, as are many GNN in frame 0 (GAA, GAG, GAT, GGA,GGG,GGT). In fact G-rich codons seem to be depleted near-start relative to C-rich codons.

There is so much data here it is difficult to judge.

Does depletion ATG codons depend on aATG score?

## # A tibble: 713,522 x 6
##      Pos Codon     n    Freq Frame ascore
##    <int> <fct> <int>   <dbl> <fct> <chr> 
##  1     1 ATG    4708 1       0     hi    
##  2     2 AAA      27 0.00573 0     hi    
##  3     2 AAC      63 0.0134  0     hi    
##  4     2 AAG      58 0.0123  0     hi    
##  5     2 AAT      17 0.00361 0     hi    
##  6     2 ACA      93 0.0198  0     hi    
##  7     2 ACC     107 0.0227  0     hi    
##  8     2 ACG      33 0.00701 0     hi    
##  9     2 ACT      85 0.0181  0     hi    
## 10     2 AGA      36 0.00765 0     hi    
## # ... with 713,512 more rows

Not really. Hi = narrow score at least 0.85

Normalized in-frame to out-of-frame ratio, per Malabat et al.

Malabat et al compare change of in-frame to out-frame depletion by

For each codon:

  1. Taking a ratio of total in:out of frame counts for codons 500 to 1000.
  2. Taking a ratio of in:out of frame counts over 9-codon windows
  3. Normalizing the latter by the former in log2-space.
  4. Plot the smoothed normalized log2-ratios for each codon as function of position.
  5. Comparing this for genes with and without internal transcription start sites (iTSS).

EW implemented this, using fixed-width 10-codon windows instead of the smoothed windows from Malabat et al.

Are good-context ATGs more depleted?

We need to count ATGs with good vs bad context. Maybe just a -3 A would work to give initial picture? Plot levels of TNNATG, CNNATG, ANNATG, GNNATG?

## # A tibble: 1,853 x 3
## # Groups:   Pos [585]
##      Pos sixmer     n
##    <int> <chr>  <int>
##  1     1 ATGAAG     1
##  2     1 ATGAGC     1
##  3     1 ATGCCC     1
##  4     1 ATGCCT     1
##  5     1 ATGGCA     1
##  6     1 ATGGTT     1
##  7     2 AACTCC     1
##  8     2 ACCTAC     1
##  9     2 CAGTAT     1
## 10     2 CCAGCT     1
## # ... with 1,843 more rows